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Abstract 

In modern scientific research, massive datasets with huge numbers of obser¬ 
vations are frequently encountered. To facilitate the computational process, 
a divide-and-conquer scheme is often used for the analysis of big data. In 
such a strategy, a full dataset is first split into several manageable segments; 
the final output is then averaged from the individual outputs of the seg¬ 
ments. Despite its popularity in practice, it remains largely unknown that 
whether such a distributive strategy provides valid theoretical inferences 
to the original data. In this paper, we address this fundamental issue for 
the distributed kernel regression (DKR), where the algorithmic feasibility is 
measured by the generalization performance of the resulting estimator. To 
justify DKR, a uniform convergence rate is needed for bounding the general¬ 
ization error over the individual outputs, which brings new and challenging 
issues in the big data setup. Under mild conditions, we show that, with a 
proper number of segments, DKR leads to an estimator that is generaliza¬ 
tion consistent to the unknown regression function. The obtained results 
justify the method of DKR and shed light on the feasibility of using other 
distributed algorithms for processing big data. The promising preference of 
the method is supported by both simulation and real data examples. 

Keywords: Distributed Algorithm, Kernel Regression, Big Data, Learning The¬ 
ory, Generalization Bounds. 


1 Introduction 

The rapid development in data generation and acquisition has made a profound 
impact on knowledge discovery. Collecting data with unprecedented sizes and com¬ 
plexities is now feasible in many scientific fields. For example, a satellite takes thou¬ 
sands of high resolution images per day; a Walmart store has millions of transac¬ 
tions per week; and Facebook generates billions of posts per month. Such examples 
also occur in agriculture, geology, finance, marketing, bioinformatics, and Internet 
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studies among others. The appearance of big data brings great opportunities for 
extracting new information and discovering subtle patterns. Meanwhile, their huge 
volume also poses many challenging issues to the traditional data analysis, where 
a dataset is typically processed on a single machine. In particular, some severe 
challenges are from the computational aspect, where the storage bottleneck and 
algorithmic feasibility need to be faced. Designing effective and efficient analytic 
tools for big data has been a recent focus in the statistics and machine learning 
communities [2T| . 

In the literature, several strategies have been proposed for processing big data. 
To overcome the storage bottleneck, Hadoop system was developed to conduct dis¬ 
tributive storage and parallel processing. The idea of Hadoop follows from a nat¬ 
ural divide-and-conquer framework, where a large problem is divided into several 
manageable subproblems and the final output is obtained by combining the corre¬ 
sponding sub-outputs. With the aid of Hadoop, many machine learning methods 
can be re-built to their distributed versions for the big data analysis. For exam¬ 
ples, McDonald et al. ra considered a distributed training approach for structured 
perception, while Kleiner et al. [10] introduced a distributed bootstrap method. 
Recently, similar ideas have also been applied to statistical point estimation mi, 
kernel ridge regression matrix factorization [10], and principal component 
analysis [25] , 

To better understand the divide-and-conquer strategy, let us consider an il¬ 
lustrative example as follows. Suppose that a dataset consists of N = 1, 000, 000 
random samples {(xj, |/i)}^ =1 cR^xK with dimension d = 100. We assume that 
the data follow from a linear model y % = xJ(3-\-e with a random noise e. The goal of 
learning is to estimate the regression coefficient f3. Let Y — (jq, ..., i/n) T be the TV- 
dimensional response vector and X = (aq,..., xn) t be the N xd covariate matrix. 
Apparently, the huge sample size of this problem makes the single-machine-based 
least squares estimate f3 = (X T X)~ 1 X T Y computationally costly. Instead, one 
may first evenly distribute the N samples into m local machines and obtain m 
sub-estimates /3j based on m independent running. The final estimate of f3 can 
then be obtained by averaging the m sub-estimates j3 = Compared 

with the traditional method, such a distributive learning framework utilizes the 
computing power of multiple machines, which avoids the direct storage and oper¬ 
ation on the original full dataset. We further illustrate this framework in Figure 1 
and refer to it as a distributed algorithm. 

The distributed algorithm provides a computationally viable route for learning 
with big data. However, it remains largely unknown that whether such a divide- 
and-conquer scheme indeed provides valid theoretical inferences to the original 
data. For point estimation, Li et al. HH showed that the distributed moment 
estimation is consistent, if an unbiased estimate is obtained for each of the sub¬ 
problems. For kernel ridge regression, Zhang et al. [28] showed that, with appropri¬ 
ate tuning parameters, the distributed algorithm does lead to a valid estimation. 
To provide some insights on the feasibility issue, we numerically compare the esti- 
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Local Machines with Size m 


Figure 1: A divide-and-conquer learning framework. 



Figure 2: Estimation errors for the distributed regression. 
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mation accuracy of (3 with that of /3 in the aforementioned example. Specifically, we 
generate x % independently from iV(0, Idxd) and set j3 based on d independent obser¬ 
vations from U [0,1]. The value of y,; is generated from the presumed linear model 
with e ~ N( 0,1). We then randomly distribute the full data tome [2°, 2 15 ] local 
machines and output ft based on m local ridge estimates f3 3 for j = 1,..., m. In Fig¬ 
ure 2, we plot the estimation errors versus the number of local machines m based 
on three types of estimators: e\ = \\fi- /3|||, e 2 = ||/3 — /3|||, and e 3 = min.,- ||/3 —/3j|||. 
For a wide range of m, it seems that the distributed estimator jd leads to a sim¬ 
ilar accuracy as the traditional (3 does. However, this argument tends to be false 
when m is overly large. This observation brings an interesting but fundamental 
question for using the distributed algorithm in regression: under what conditions 
the distributed estimator provides an effective estimation of the target function? 
In this paper, we aim to find an answer to this question and provide more general 
theoretical support for the distributed regression. 

Under the kernel-based regression setup, we propose to take the generaliza¬ 
tion consistency as a criterion for measuring the feasibility of the distributed algo¬ 
rithms. That is, we regard an algorithm is theoretically feasible if its generalization 
error tends to zero as the number of observations N goes to infinity. To justify 
the distributed regression, a uniform convergence rate is needed for bounding the 
generalization error over the m sub-estimators. This brings new and challenging 
issues in analysis for the big data setup. Under mild conditions, we show that the 
distributed kernel regression (DKR) is feasible when the number of its distributed 
sub-problems is moderate. Our result is applicable to many commonly used re¬ 
gression models, which incorporate a variety of loss, kernel, and penalty functions. 
Moreover, the feasibility of DKR does not rely on any parametric assumption on 
the true model. It therefore provides a basic and general understanding for the dis¬ 
tributed regression analysis. We demonstrate the promising performance of DKR 
via both simulation and real data examples. 

The rest of the paper is organized as follows. In Section [2j we introduce model 
setup and formulate the DKR algorithm. In Section [3j we establish the generaliza¬ 
tion consistency and justify the feasibility of DKR. In Section |4| we show numerical 
examples to support the good performance of DKR. Finally, we conclude the paper 
in Section [5] with some useful remarks. 


2 Distributed Kernel Regression 

2.1 Notations 

Let Y e [-M, M] C I be a response variable bounded by some M > 0 and 
X G X C be its d-dimensional covariate drawn from a compact set X. Suppose 
that Z = X xY follows from a fixed but unknown distribution p with its support 
fully filled on Z — [— M,M] x X. Let S = {zi = (■ yi,Xi),i = be N 
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independent observations collected from Z. The goal of study is to estimate the 
potential relationship f* : X —>■ R between X and Y through analyzing S. 

Let £(.) be a nonnegative loss function and / be an arbitrary mapping from X 
to M. We use 

£(f) = EMf.z)] = 

to denote the expected risk of /. The minimizer f p = arg min £(/) is called the 
regression function, which is an oracle estimate under £ and thus serves as a bench¬ 
mark for other estimators. Since p is unknown, f p is only conceptual. Practically, 
it is common to estimate /* through minimizing a regularized empirical risk 

nhn {S s (f) + A||/||}, (1) 

where £F is a user-specified hypothesis space, £s(f) = ^(/; Z i)/N is the em¬ 

pirical risk, ||.|| is a norm in J 7 , and A > 0 is a regularization parameter. 

Framework (JT|) covers a broad range of regression methods. In the machine 
learning community, it is popular to set T by a reproducing kernel Hilbert space 
(RKHS). Specifically, let K : X x X —> M be a continuous, symmetric, and semi¬ 
positive definite kernel function. The RKHS Hk = span{/l (x, ,),x G X} is a 
Hilbert space of L 2 -integrable functions induced by K. For any / = JT aiK(ui ,.) 
and g = JT [3iK(v tl .), their inner product is defined by 

< f,9 >k= ^ ai/3jK(ui, Vj) 
hi 

and the kernel L 2 norm is given by ||/|||- =< /, / >k ■ It is easy to verify that 

/0) =< f,K(x, ■) > Hk (2) 

for any / G 'Hk- Therefore, K is a reproducing kernel of 1~Lk- Readers may refer 
to n bu for more detailed discussions about RKHS. 

Let C(X) denote the space of continuous functions on X. It is known that Hk 
is dense in C(X) with appropriate choices of K [15]. This property makes Hk a 
highly flexible space to estimate an arbitrary f* G C(X). In this paper, we follow 
framework ([l]) with £F = Hk and ||.|| = ||.||^ for some p > 0. 

2.2 The DKR Algorithm 

We now consider ([l]) in the big data setup. In particular, we assume that sample S 
is too big to be processed in a single machine and thus we need to use its distributed 
version. Suppose S is evenly and randomly assigned to m local machines, with each 
machine processing n = N/m samples. We denote by Sj, j — 1, 2,..., m the sample 
segment assigned to the jth machine. The global estimator is then constructed 
through taking average of the m local estimators. Specifically, by setting T = Hk 
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Algorithm 1 The DKR Algorithm 
Input: S, K, A, m 

Output: / 

1: Randomly split S into m sub-samples Si, ..., S m and store them separately 
on m local machines. 

2: Let T m [.] be a truncation operator with a cutoff threshold M. For j = 
1,2 ,,m, hnd a local estimator based on Sj by 


Si = Tm [/j]. 


where 



3: Combine fjS to get a global estimator 



in (JTJ), this strategy leads to the distributed kernel regression (DKR), which is 
described as Algorithm [lj 

By representer theorem im. fj in step 2 of DKR can be constructed from 
span{/l (xj, .), Xi G Sj}. This allows DKR to be practically carried out within fi¬ 
nite n-dimensional subspaces. The distributive framework of DKR enables parallel 
processing and thus is appealing to the analysis of big data. With rn — 1, DKR 
reduces to the regular kernel-based learning, which has received a great deal of 
attention in the literature [18] ra 122. With quadratic £ and p = 2, Zhang et. 
al. [28] conducted a feasibility analysis for DKR with m > 1. Unfortunately, their 
results are built upon the close-form solution of fj and thus are not applicable to 
other DKR cases. In this work, we attempt to provide a more general feasibility 
result for using DKR in dig data. 

3 Consistency of DKR 

3.1 Preliminaries and Assumptions 

In regression analysis, a good estimator of f* is expected not only to fit training 
set S but also to predict the future samples from Z. In the machine learning com¬ 
munity, such an ability is often referred to as the generalization capability. Recall 
that f p is a conceptual oracle estimator, which enjoys the lowest generalization 
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risk in a given loss. The goodness of / can be typically measured by 

£(/) - £(/„) = E= [<(/, s) - <(/ P , *)]• (3) 

A feasible (consistent) / is then required to have generalization error ([ 3 J converge 
to zero as N — * 00 . When the quadratic loss is used, the convergence of (|3]) also 
leads to the convergence of \\f — f p |[ 2 , which responds to the traditional notion of 
consistency in statistics. 

When l is convex, Jensen’s inequality implies that 

-I m 

u/)-Wp)<-£m)-my]- 

3 = 1 

Therefore, the consistency of / is implied by the uniform consistency of the m 
local estimators fj for j — 1,, m. Under appropriate conditions, this result may 
be straightforward in the fixed m setup. However, for analyzing big data, it is 
particularly desired to have m associated with sample size N. This is because the 
number of machines needed in an analysis is usually determined by the scale of 
that problem. The larger a dataset is, the more machines are needed. This in turn 
suggests that, in asymptotic analysis, m may diverge to infinity as N increases. 
This liberal requirement of m poses new and challenging issues to justify / under 
the big data setup. 

Clearly, the effectiveness of a learning method relies on the prior assumptions 
on f p as well as the choice of i. For the convenience of discussion, we assess the 
performance of DKR under the following conditions. 

Al f p G C(X) and ||/ p ||oo < M, where H-Hoo denotes the function supremum 
norm. 

A2 The loss function i is convex and nonnegative. For any f\,fi G C(X) and 
z G 2, there exists a constant L such that 

Wi, z) - i(f 2 , z)\ < t||/i-/ 2 |u. 

A3 For any 00 > 0 and g G C(X), there exists a / G Hk, such that \\f — g||oo < oj. 
Moreover, let Br = {/ G Hr, ||/||oo < R} for some R > 0. There exists 
constants C 0 , s > 0, such that 

log Woo (Hi, 7 ) < Co7“ s , 

where W 0O (^ r , 7 ) denotes the covering number of a set T by balls of radius 
7 with respect to ||.||oo- 
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Condition Al is a regularity assumption on f p , which can be trivial in applica¬ 
tions. For the quadratic loss, we have f p (X) = E(U | A") and thus Al holds naturally 
with Y G [— M,M ]. Condition A2 requires that £(/, z) is Lipschitz continuous in 
/. It is satisfied by many commonly used loss functions for regression analysis. 
Condition A3 corresponds to the notion of universal kernel in pTS] , which implies 
that Hk is dense in C(X). It therefore serves as a prerequisite for estimating an 
arbitrary f* G C{X) from %k- A3 also requires that the unit subspace of %k 
has a polynomial complexity. Under our setup, a broad choices of K satisfy this 
condition, which include the popular Gaussian kernel as a special case [2D] [3D]. 

3.2 Generalization Analysis 

To justify DKR, we decompose (J3]) by 

£(f)-£(f P ) = £s(f)-£(f) + £(f)-£s(f) 

'- . -' 

sample error 

+ £ s (f)-£ s (f) 

^-V-^ 

hypothesis error 

+ £(f)-£(f p ), 

" -V-' 

approximation error 

where / is an arbitrary element of "Hyc- The consistency of / is implied if ([ 3 J has 
convergent sub-errors in Q-([6]). Since / G H/c is arbitrary, ([6]) measures how close 
the oracle f p can be approximated from the candidate space Hfc- This is a term 
that purely reflects the prior assumptions on a learning problem. Under Conditions 
A1-A3, with a / such that ||/ — f p \\ < N~ l , (J6|) is naturally bounded by L/N. We 
therefore carry on our justification by bounding the sample and hypothesis errors. 

3.2.1 Sample Error Bound 

Let us first work on the sample error Q, which describes the difference between 
the expected loss and the empirical loss for an estimator. For the convenience of 
analysis, let us rewrite Q as 

£s(f)~£(f) + £(f)-£s(f) 

1 N N 

= -E,K,)} + |e,(&) - jv !>(*>}> (7) 

i =1 2=1 

where ^i(^) = £(f,z) — £(f p ,z) and ^ 2 (*) = ~ If should be noted 

that the randomness of £i is purely from Z, which makes E z (^i) a fixed quantity 
and £i (zi)/N a sample mean of independent observations. For £ 2) since / is 
an output of S, E^(£ 2 ) is random in S and £ 2 (A)s are dependent with each other. 
We derive a probability bound for the sample error through investigating (J7|). 


(4) 

(5) 

( 6 ) 







To facilitate our proofs, we first state one-side Bernstein inequality as the 
following lemma. 

Lemma 1. Let yi,... ,yx be N independently and identically distributed random 
variables with E^) = y and var(yi) = a 2 . If \yi — y\ < T for some T > 0, then 
for any e > 0, 

The probability bounds for the two terms of ([T]) are given respectively in the 
following propositions. 

Proposition 1. Suppose that Conditions A1-A2 are satisfied. For any 0 < S < 1 
and f G Fix, we have 


P 


1 

N 


N 

£>(*)-E*(£i)<2 L|| f-f p 

i= 1 


^log(l/5) + ^/ log(l/5) j 


>1-5. 


Proof. Let / be an arbitrary function in Fix■ By Condition A2, we have 


Uz)\ = m,z)-£(f p ,z)\<L\\f-f l 


p 11 oo 


for some constant L > 0. This implies that varf^q) < L 2 \\f — fpW^ and |£i — 
E z ( 6 )| < 2L||/ - fpWoo- By Lemma [l) we have, 

r i N n c iVf 2 

*■{ - *•(&) > 2 »P {-2(L»||/-/,||L + 2/3t||/-/,|U) 


for any e > 0. Denoting the right hand side of (| 8 ]) by 5, we have 

Ns 2 + ^ L\\f - /plloo log 8e + 2L 2 \\f - / p ||^ log 5 = 0. 

The positive root of ([9]) is given by 

iL\\f ~ /plloologl/5 + L\\f - /plU^f log 2 1/5 + 8 AAog 1/5 


( 8 ) 

(9) 


e = 


2N 


< £ ll/-/plU I + 




s mi -a.)*.V tF) 


( 10 ) 


The proposition is proved by setting e = e* in 


□ 
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Proposition 2 . Suppose that Conditions A1-A3 are satisfied. For any 0 < 6 < 1 
and f e FLk, we have 


P 


E,( 6 ) - if < 12 ML +N -«™ 


> 1-5 


where V{N,5) = C 0 {8LMN l F s+ ^) s - log 5. 

Proof. Let Vm = {/ € C(X), ||/||oc < M}. Under Condition A3, B->m C FLk 
is dense in Dm- Therefore, for any e > 0, there exists a g e G £> 2 m, such that 
IIZ-^lloo < e. By A2, we further have 


^(/» ~t( 9 e,z) < Le. 


Consequently, 


iV 


= u/) - £(/,) - [£?(/) - Ss(f,)] 


i =1 


< £(»,)- £(/,)-[£ s (9«)-£ S (/ P )]+2Le. (11) 


Let U 1 C B2M be a cover of £>2m by balls of radius 7 with respect to ||.||oo- With 
e —> 0 , ( JITj ) implies that 


N 




< P | sup £(g) - S(f p ) - [£ s (g) - £ s (f p )] > £ 


2 — 1 


< P < sup £( 3 ) - £{f P ) - [£$( 0 ) - £s(/ P )] > £ - 2 Ly 

< A/"oo(i 3 2 M, 7 ) m axP{£(fi') -£{f P ) ~ [Ss(g) -£s(f P )\ >£- 2 L 7 } 

sef / 7 


< A/' 00 (^2M,7) eX P 


AI(£ - 2L7) 5 


1 


2[9 L 2 M 2 + 2(£ - 2L 7 )LM] J ’ 
where the last inequality follows from Lemma[TJ By A3, we have 
fifoo(B 2M n) =fif 00 {B 1 ,'y/2M) < exp{C'o(2M/ 7 ) s }. 


( 12 ) 


(13) 


Let 7 = e/AL. Inequality (12) together with (13) further implies that 


p |e z (6 ) ~ JfiYl &(*i) - £ | - exp j^o 


,8LM„ 


N(sf 


72 L 2 M 2 + 8eLM 


(14) 
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When e > N T for some r > 0, (14) implies that 

p (e 2 (6) - i £ &(*) > 4 - exp ( C ^ LMNT ) 


i=l 




72 L 2 M 2 + 8 eLM 


.( 15 ) 


Denote the right hand side of (15) by 8. Following the similar arguments in (|9]) 
(10), we have 


N 


P < E»« 2 ) - - £>(2,) > ML I 


(8V(N, 5) + 6^2V(N,5)N 


N 


+ N~ T } < 5, (16) 


where V(N,S ) = Cq{%LMN t ) s — log 8. The proposition is proved by setting r = 
l/(s T 2), which minimizes the bound order in (fl6|. □ 


Based on Propositions [I] and |2j decomposition ([7]) implies directly the following 
probability bound of the sample error. 

Theorem 1 . (Sample Error) Suppose that Conditions A1-A3 are salified. Let 
M' = max{2M, \\f — f p \\oo}- For any f e TLk and 0 < 5 < 1, we have, with 
probability at least 1 — 8, 

£sU) - £(f) + ~ £s(A) < 6M'L {IAEA + T A N < s ) 1 1 


N 


N 2 


1 ) 
N 2+s 


(17) 


where 


Ti{N,8) = V(N,8/2) + \og(2/8), 

T 2 (N,8) = ^V(N,8/2) + ^log(2/8). 

When ||/ — fpWoo is bounded, the leading factor in (17) is ^/V(N, 8/2)/N. In 
that case, Theorem [l] implies that the sample error Q has an 0(A^^ 1 P 2+s ^) bound 
in probability. Under our model setup, this result is general for a broad range of 
continuous estimators that is bounded above. 


3.2.2 Hypothesis Error Bound 

We now continue our feasibility analysis on the hypothesis error (|5j), which mea¬ 
sures the empirical risk difference between / and an arbitrary /. When DKR is 
conducted with m — 1, / corresponds to the single-machine-based kernel learning. 
By setting A = 0, the hypothesis error has a natural zero bound by definition. 
However, this property is no longer valid for a general DKR with m > 1. 

When i is convex, we have ([5]) bounded by 

1 N f 1 m _ A i N 

£s(f)-e s (f) = jv -jv £*(/’*) 

*=1 V 3=1 / t= 1 

m 

< -E{ s Efi)-£s(f)}- (is) 

3 =1 
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This implies that the hypothesis error of / is bounded by a uniform bound of 
the hypothesis errors over the m sub-estimators. We formulate this idea as the 
following theorem. 


Theorem 2. (Hypothesis Error) Suppose that Conditions A1-A3 are satisfied. For 
any 0 < 5 < 1 and f E TL/c, we have, with probability at least 1 — 5, 


fis(f ) - SsLf ) < glm> 


71M/2) , T 2 (n,S/ 2) 


+ 


n 


n 2 


+ 


U 2+s 


+ 2A|| f\\ p K , 


where M', T\, and T 2 are defined in Theorem 1. 


Proof. Without loss of generality, we prove the theorem for / with m > 1. Recall 
that DKR spilt S into m segments Si,..., S m . Let S/Sj be the sample set with 
Sj removed from S and £q = ^2 z . e n^(f,Zi)/q be the empirical risk for a sample 
set Q of size q. Under A2, we have £ is convex and thus 


1 m 

£s(f)-£ s (f) < -£{ £ s(A)- £ sCf)} 


j = 1 


m 


m 


E 

j=1 L 

m 

E 

j =1 L 


- £ S j (/)) + 


m N — m 

—Bj H- Ui 

N 3 N 3 


-%5,(/)) 

(19) 


where B 3 = (£ Sj (fj) - S Sj {f)) and Uj = (£s/ Sj (fj) ~ S S / Sj {f)).^ 

Let us first work on the first term of (19). By definition of fj, we know that 


Ssfifj) + Mim < + X\\fj\\ p K < Ssfif) + All/ll 


V 

K 


Therefore, 


Bj = S 8i {fj) - £ Sj (f)) < A||/||^ - A|| fJ K < A||/|| 


v 

K' 


( 20 ) 


This implies that the first term of (Il9| is bounded by m 


K 


/N. 


We now turn to bound the second term of (19). Specifically, we further decom¬ 
pose Uj by 


Uj — U\j + U2j + U3j + U4j + Bj 

< Uij + U2j + U3j + U4j + A||/||^, 

where 

u ij = £s/Sj (fj) ~ £s/Sj ( f P ) - £(fj ) + £(f P ) 
u 2j = £(f)-£(f P )^£s/s j (f) + Ss/s j (f P ) 
u 3j = £ Sj (f) - £ Sj if P ) + S (fp)^£ (/) 

U4j = SifA-Siffi-Ssffi) ■ S Sj (f p ) 
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Note that fj is independent of S/Sj. Proposition |l| readily implies that, with 
probability at least 1 — 5, 


u±j < ALM 


/ log(l/5) / log(l/5) \ 
l N — n V N -n /’ 




log(l/5) + /log(l/5) 


N — n 


N — n 


U2j < 2L||/ — f p \\oo 

< Wf - All- • 




Also, by applying Proposition [2] with m = 1, we have, with probability at least 
1-5, 

V( n i 5) + V%5)n ^| + w _ 1/(a+2) 

n 


u 4j < 12ML 


with the same Id defined in Proposition [2] Consequently, we have, with probability 
at least 1 — 5, 


Uj < 6LM 


, f y(w,5/4) + log(4/5) + y/to^4/5+VC(n,5/4 )' 


+ 


n 


na 


n 2 += 


+m\ p K, 


( 21 ) 


where M' = max{2M, \\f — f p \\oo}- 

Inequalities (20) and (21) further imply that, with probability at least 1 — 5 

'Ti(n, 5/2) , T 2 (n,5/2 )' 


S s (f)-S s (f)<6LM 1 


+ 


n 


n2 


+ 


U 2+s 


+ 2A||/HSr- 


The theorem is therefore proved. 


□ 


Theorem [2] implies that, with appropriate / and A, the hypothesis error of DKR 
has an 0(n -1 A 2+s )) bound in probability. This results is applicable to a general / 
with m > 1, which incorporates the diverging m situations. 


3.3 Generalization Bound of DKR 


With the aid of Theorems 1-2, we obtain a probability bound for the generalization 
error of / as the following theorem. 


Theorem 3. (Generalization Error) Suppose that Conditions A1-A3 are satisfied. 
When N is sufficiently large, for any 0 < 5 < 1, 


Ti(n,S/4) , T 2 (n,6/4)\ 2 + L 0>||<||P 

H-i- I H-1-rdA||Jo||x 


S(f)-S(f p )<2ALM 

V n n 2 / n 2 + s 

with probability at least 1 — 5, where f 0 G V ./c and ||/ 0 — / p ||oo A IV -1 . 
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Proof. Under Conditions Al and A3, for any N > 1, there exists a /o G PLx such 
that \\fo — f p \\ < N~f Under A2, this also implies that ([ 6 j is bounded by L/N < 
i/n 1 /( 2+s ). Clearly, when N is sufficiently large, M' = max(2M, ||/ 0 — f p ||) = 2 M. 
The theorem is a direct result by applying Theorems [l]{2] to Q and (|5]) with 
/ = fo- □ 

Theorem [ 3 ] suggests that, if we set A = o(||/ 0 || j(- p n _ 1 A 2 + s )) 5 the generalization 
error of / is bounded by an 0 (n _ 1 / A 2 +d) term in probability. In other words, as 
n —> 00 , a properly tuned DKR leads to an estimator that achieves the oracle 
predictive power. This justifies the feasibility of using divide-and conquer strategy 
for the kernel-based regression analysis. Under the assumption that f p G we 
have /o = f p and thus / is feasible with A = o(n _1 ^ 2+s - ) ). Moreover, when DKR 
is conducted with Gaussian kernels, Condition A3 is satisfied with any s > 0 and 
thus S(f) enjoys a nearly O p (n -1 / 2 ) convergence rate to £{f p ). 

Theorem [3] provides theoretical support for the distributed learning framework 
(Algorithm 1). It also reveals that the convergence rate of £(/) is related to the 
scale of local sample size n. This seems to be reasonable, because fj is biased from 
f p under a general setup. The individual bias of fj may diminish as n increase. It, 
however, would not be balanced off by taking the average of fj s for j = 1,,m. 
As a result, the generalization bound of / is determined by the largest bias among 
the m fjS. When fj is (nearly) unbiased, its generalization performance is mainly 
affected by its variance. I 11 that case, / is likely to achieve a faster convergence 
rate by averaging over fjS. We use the following corollary to show some insights 
on this point. 


Corollary 1. Suppose that DKR is conducted with the quadratic loss and A = 0. 
IfE[fj(x ) — f p fx )] = 0 for any x G A, then under Conditions A1-A3, we have 

£(/)-£(/„)= o P (Al 

V mn 2 + s 

Proof. Let px be the marginal distribution of X. When the quadratic loss is used, 
we have 



£(f) - £(/„) = 11/- h IlL = / (/(V) - f p {X)) 2 dpx 


( 22 ) 


' X 


Since we assume E[/j(x)] = f p (x) for any x G A, (22) implies that 


mi) - zu,)] = 



(Rx) - f p (x)) 2 d Px dp 


S J X 


(E[/(.Y) - f„(X)]) 2 dp x + / E[/(.Y) - f p (X)] 2 dpx 


' X 


lx 


= - [ E[/i(X) - MXpdpx 
m Jx 

= fe.\£(%)-£(!„)}. (23) 

lit 


14 




Applying Theorem [3] with m — 1 and A = 0, we have, for some generic constant 

C > 0, 


P -S(f p ) > C\og(8/5)n~*r. < 5 


Let t = C\og(8/S)n 2 + s . Inequality (24) implies that 

E[£(/i)-£(/,)] = 

< 


P{ S(f 1 )-S(f p )>tjdt 

8 exp (—C' _1 n 2 ^f) dt 


< 8Cn 2 + s . 


(24) 


This together with (23) implies that E[£(/)—£(/ p )] = 0{m l n 2 + s ), which further 
implies the corollary. □ 


Corollary [T| is only conceptual, because it is usually difficult to construct an 
unbiased f 3 without strong prior knowledge. Nevertheless, it sheds light on design¬ 
ing more efficient DKR with less biased sub-estimators. In practice, this may be 
conducted by choosing a small A or using some debiasing techniques in Algorithm 
1. In this paper, we focus on providing a general feasibility support for DKR and 
leave this issue for the future research. 

It should also be noted that, under Theorem [3} DKR is feasible only when n —y 
oo or equivalently m = o(N). This means that, to have DKR work well, the sample 
size in each local machine should be large enough. This seems to be a natural 
condition, because for a large-m-small-rt situation, each local output fj is unlikely 
to provide a meaningful estimate. As a consequence, the global estimation f\ may 
not be well constructed neither. In real applications, an appropriate m should be 
used such that the associated DKR achieves a good balance of algorithmic accuracy 
and computational efficiency. 


4 Numerical Studies 

We evaluate the finite sample performance of DKR through both simulation and 
real data examples. In particular, we assess the distributive strategy for several 
popular regression methods in terms of both computational efficiency and gener¬ 
alization capability. All numerical studies are implemented by MATLAB 8.2 on a 
windows workstation with 8-core 3.07GHz CPUs. 

4.1 Simulation 

In the simulation studies, we assess DKR on a hypothetical learning problem with 
d — 2. Specifically, we generate independent observations based on model 

Y = sine(20A] — 10) x sine(20A 2 — 10) + e, (25) 
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(a) 



m 


(b) 



m 


Figure 3: The generalization performance of DKR in Case (i). Plot (a): RMSE for 
/; Plot (b): RMSE for 


where (X\,X 2 ) denotes the two attributes of covariate X, e is an observational 
noise, and 


sinc(x) 


^*1, X^O 

1, x = 0 ' 


The values of (X ], X 2 ) are sampled based on a uniform distribution on [0,1] x [0,1]. 

We evaluate DKR based on model (25) under two cases: (i) we set N = 100, 000 
and generate data with e rs_/ IV(0, 0.2); (ii) we generate N\ = 80, 000 samples with 
e ~ 1V(0,0.1) and N 2 = 20,000 samples with e ~ U[—2 ,2], The second case is 
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Figure 4: The generalization performance of DKR in Case (ii). Plot (a): RMSE for 
/; Plot (b): RMSE for 


designed such that the data contain about 20% outliers. This setup poses further 
challenges for DKR in learning the relationship between Y and X. 

Regarding the implementation of DKR, we set the number of partition m = 
20, 50,100, 200, 500, and 1000, so that the minimum sample size in each local ma¬ 
chine is 100. We set the thresholding value M — 1 and build the dictionary T~Lk 
by the popular Gaussian kernel 


K(xi,x 2 ) = exp(— Hxi - x 2 \\l/r 2 ) 


(26) 
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Table 1: Simulation results: averaged computational time of DKR in second. 



m = 

20 

50 

100 

200 

500 

1000 

Case (i) 

Ridge 

27.6 

1.91 

0.26 

0.04 

< 0.01 

< 0.01 


LASSO 

74.8 

13.6 

4.93 

2.54 

1.91 

1.20 


SVR 

0.94 

0.25 

0.09 

0.04 

0.02 

0.01 

Case (ii) 

Ridge 

28.1 

1.94 

0.26 

0.04 

< 0.01 

< 0.01 


LAD 

112 

15.2 

2.67 

0.76 

0.23 

0.16 


LAD-LASSO 

104 

17.4 

2.31 

0.59 

0.17 

0.08 


with r = 0.05. In Case (i), we conduct DKR with three popular regression methods 
under framework (JT|) : ridge regression (L 2 -loss plus Z/ 2 -regularization), LASSO (L 2 - 
loss plus L i-regularization), and SVR (e-intensive-loss plus L 2 -regularization); in 
Case (ii), we conduct DKR based on two robust regression methods: LAD (L i- 
loss plus L 2 -regularization) and LAD-LASSO (Li-loss plus Li-regularization). In 
our simulations, we choose the tuning parameter A based on a few pilot runs of 
DKR with m = 20 and use the standard MATLAB packages for computing the 
corresponding regression estimators. 

To assess the generalization capability of DKR, we generate an independent 


testing set {(yi,Xi) : i = 1,... ,n t } of size n t = 5000 from model (25) with e = 0 
and compute 

1 ... . V 2 

RMSE(/)=<;-Ek (*<)-»| 2 

i= 1 


n t 


We report the averaged RMSE of DKR for each setup based on 100 repetitions. 
For comparison, we also report the RMSE of the corresponding standard (non¬ 
distributive) regression method based on 1/m of the data. 

The simulation results are shown in Figures [3]j4| where the associated com¬ 
putational cost is given in Table [lj We observe that, when m is moderate, the 
DKR approach performs quite well in achieving a low RMSE for all tested regres¬ 
sion methods. This partially justifies the feasibility result obtained in this work. 
In our setup, choosing m G (50,100) seems to be the most attractive, because 
the associated DKR estimator enjoys a strong generalization capability at a low 
computational cost. Clearly, by using multiple machines, DKR retains almost the 
same computational cost as the standard non-distributive method using only 1/m 
of the data. Meanwhile, with a moderate m, it significantly improves the result¬ 
ing estimator over the single machine-based local output. The framework of DKR 
therefore serves as a viable route for conducting efficient leaning for big data. 

It should also be noted that the performance of DKR may deteriorate when m 
is overly large. In Case (i) with m = 1000, DKR does not help much in reducing 


the RMSE of the single-machine-based estimator. As discussed in Section 3AL this 
might be caused by the estimation bias and insufficient sample size for each local 
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machine. In principle, a smaller m helps to improve the effectiveness of DKR, 
but it also leads to a higher computational cost. In practice, one should conduct 
DKR with different choices of m and select an appropriate value based on specific 
situations. It might be a good idea to set m as the smallest value within the 
affordable computational budget. 

DKR also inherits reasonable robustness against outliers from the associated 
local outputs. This is revealed by the low RMSE of / conducted on LAD and 
LAD-LASSO in Case (ii) with m < 50. 

4.2 Real data example 

We apply DKR to analyze a real world dataset, which contains 583,250 instances 
of Twitter discussions on topics related to new technology in 2013. Each instance 
is described by d = 77 features related to that discussion. It is of interest to 
predict the number of active discussions (Y) based on these features (A"). To 
facilitate the computing process, we include the instances with Y G [20,200] 
in our analysis, which leads to a training set with size 174, 507 and a testing 
set with size 19,390. We standardize each attribute of X such that it has a 
zero mean and a unit standard deviation. Readers may refer to Buzz Data on 
http://archive.ics.uci.edu/ml/datasets.html for more detailed information about 
this dataset. 


Table 2: RMSE for the analysis of Buzz data. 


m = 

40 

120 

300 

500 

1000 

Ridge 

24.8 

25.3 

25.6 

25.9 

26.5 

LASSO 

24.9 

25.3 

25.6 

26.0 

26.4 

LAD 

25.1 

25.4 

25.9 

26.0 

26.3 


Similar to our simulation studies, we build BLk based on the Gaussian kernel 
(26) with r = 10. We set m = (40,120,300,500,1000) and apply DKR to the 
training sample with Ridge, LASSO, and LAD. We summarize the analysis in 
term of RMSE based on the testing sample, which is shown in Table [2] Like many 
other social media data, this dataset is known to be noisy and highly skewed. Thus, 
the results in Table [2] indicate the decent performance of DKR. In this example, we 
observe that the results are not very sensible to the choice of m. Thus, researchers 
may prefer a larger m for the computational convenience. 


5 Conclusion 

In this paper, we studied the distributed kernel regression for learning with big 
data. DKR follows from a divide-and-conquer framework, which enables distribu- 
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tive storage and parallel computing. In DKR, the performance of the global es¬ 
timator is determined by a uniform bound over the distributed local estimates. 
Under mild conditions, we show that DKR provides a consistent estimate that 
leads to the oracle generalization risk. Our results offer a general theoretical sup¬ 
port for DKR, which is applicable to a broad range of regression methods. As 
the first step, the current work focus only on the feasibility of DKR. It would be 
important to further investigate its efficiency and develop the corresponding accel¬ 
eration methods. Also, it is promising to extend the current distributive framework 
to other learning tasks, such as classification and variable selection. We leave all 
these interesting topics for the future research. 
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